-
Notifications
You must be signed in to change notification settings - Fork 169
Support INT8 Weight-Only Quantization #263
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support INT8 Weight-Only Quantization #263
Conversation
4f99116
to
aa960ea
Compare
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #263 +/- ##
=======================================
Coverage 73.76% 73.76%
=======================================
Files 171 171
Lines 17618 17619 +1
=======================================
+ Hits 12996 12997 +1
Misses 4622 4622 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
aa960ea
to
d6f8908
Compare
d6f8908
to
d989313
Compare
WalkthroughAdds a weight-only int8 quantization option ("int8_wo") across configs, utilities, examples, scripts, and tests; updates per-layer quantization decision to choose SQ vs WO based on input-quantizer presence/enabled; fixes a documentation typo. Changes
Sequence Diagram(s)sequenceDiagram
autonumber
actor User
participant CLI as Script/CLI
participant HF_PTQ as examples/llm_ptq/hf_ptq
participant Quant as modelopt/torch/export/quant_utils
participant Export as modelopt/torch/export
User->>CLI: invoke with QFORMAT=int8_wo
CLI->>HF_PTQ: validate args (accept int8_wo)
HF_PTQ->>Quant: request per-layer quant decision
Note right of Quant #D3E4CD: Decision for 8-bit weights:\nif input_quantizer present & enabled -> INT8_SQ\nelse -> INT8_WO
Quant->>Export: to_quantized_weight (INT8_SQ / INT8_WO)
Export-->>HF_PTQ: return quantized artifacts
HF_PTQ-->>User: produce HF export artifacts (int8_wo)
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Poem
Pre-merge checks and finishing touches❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✨ Finishing touches
🧪 Generate unit tests
📜 Recent review detailsConfiguration used: CodeRabbit UI Review profile: CHILL Plan: Pro 📒 Files selected for processing (9)
✅ Files skipped from review due to trivial changes (1)
🚧 Files skipped from review as they are similar to previous changes (6)
🧰 Additional context used🧬 Code graph analysis (1)modelopt/torch/export/quant_utils.py (1)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)
🔇 Additional comments (7)
Tip 👮 Agentic pre-merge checks are now available in preview!Pro plan users can now enable pre-merge checks in their settings to enforce checklists before merging PRs.
Please see the documentation for more information. Example: reviews:
pre_merge_checks:
custom_checks:
- name: "Undocumented Breaking Changes"
mode: "warning"
instructions: |
Pass/fail criteria: All breaking changes to public APIs, CLI flags, environment variables, configuration keys, database schemas, or HTTP/GraphQL endpoints must be documented in the "Breaking Change" section of the PR description and in CHANGELOG.md. Exclude purely internal or private changes (e.g., code not exported from package entry points or explicitly marked as internal). Please share your feedback with us on this Discord post. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
d989313
to
12d955e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 5
🧹 Nitpick comments (2)
modelopt/torch/quantization/config.py (1)
634-635
: Remember to surface this in user-facing docs/tables.Please add “INT8 Weight-only (W8A16)” to the “Quantization Formats” table to avoid discoverability gaps. I can send a doc patch if you want.
modelopt/torch/export/quant_utils.py (1)
92-122
: Unused helper (maybe_transpose_expert_weight_dimensions).Defined but not used. Either wire it into the packing paths that need it or drop to avoid dead code.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (8)
docs/source/guides/_compress_quantized_models.rst
(1 hunks)examples/llm_ptq/hf_ptq.py
(3 hunks)examples/llm_ptq/scripts/huggingface_example.sh
(2 hunks)modelopt/torch/export/model_config.py
(2 hunks)modelopt/torch/export/quant_utils.py
(5 hunks)modelopt/torch/quantization/config.py
(2 hunks)tests/examples/llm_ptq/test_llm_ptq.py
(1 hunks)tests/gpu/torch/export/test_export.py
(2 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
- GitHub Check: linux
- GitHub Check: code-quality
- GitHub Check: build-docs
🔇 Additional comments (12)
docs/source/guides/_compress_quantized_models.rst (1)
35-35
: Typo/clarity fix looks good.Reads well and aligns with the section intent.
modelopt/torch/export/model_config.py (2)
32-32
: Expose INT8 weight-only enum — OK.Constant addition is consistent with other quant identifiers.
205-205
: Comment update matches behavior.The concat path now rightly mentions Int8 WO alongside SQ/AWQ/NVFP4.
modelopt/torch/quantization/config.py (1)
181-188
: Config definition for INT8 weight-only — OK.Per-channel weights-only with axis=0 and inputs disabled; sensible default “max”.
tests/examples/llm_ptq/test_llm_ptq.py (1)
98-98
: Test matrix expanded to int8_wo (HF) — good coverage bump.tests/gpu/torch/export/test_export.py (2)
60-60
: Import of INT8_WEIGHT_ONLY_CFG — OK.
327-328
: Block size expectation for INT8 WO = 0 — sane.examples/llm_ptq/hf_ptq.py (2)
65-65
: QUANT_CFG_CHOICES: added int8_wo — OK.
95-106
: HF auto-quant allowlist includes int8_wo — OK.Keeps “int8” out of auto path, as intended.
modelopt/torch/export/quant_utils.py (3)
53-54
: New enum import — OK.
458-462
: INT8 detection split (SQ vs WO) — OK.Uses input_quantizer.is_enabled to distinguish; matches config semantics.
633-635
: Per-layer config mapping for int8_wo → W8A16 — OK.
examples/llm_ptq/hf_ptq.py
Outdated
"int8", | ||
"int8_wo", | ||
"int4_awq", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Verification agent
🧩 Analysis chain
Mismatch with huggingface_example.sh allowlists.
HF path here allows “int8” and “int8_wo”, but the script denies “int8”, causing CLI rejection before Python runs. Update the script allowlists accordingly (see suggested patch in the script file comment).
🏁 Script executed:
#!/bin/bash
set -e
# Locate the huggingface example script
script_path=$(find . -type f -name huggingface_example.sh | head -n1)
echo "Found script at: $script_path"
# Show lines around allowlist/quantization options
echo "=== Context around allowlisted quant types ==="
grep -nE 'allowlist|quant' -C3 "$script_path"
# Show any explicit int8/int8_wo mentions
echo "=== int8 / int8_wo occurrences ==="
grep -nE 'int8|int8_wo' -n "$script_path"
# Show validation logic (case statements) for quant types
echo "=== Case statements for quant type validation ==="
grep -nR -n 'case .* in' -C2 "$script_path"
Length of output: 3127
Add “int8” to quantization allowlist in huggingface_example.sh
In examples/llm_ptq/scripts/huggingface_example.sh, update both quant-format case statements (around lines 48 and 77) to include “int8” alongside “int8_wo” and adjust the corresponding error-message lists so that “int8” is accepted.
🤖 Prompt for AI Agents
In examples/llm_ptq/scripts/huggingface_example.sh around the quant-format case
statements near lines ~48 and ~77, the script currently accepts "int8_wo" (and
other formats) but does not include "int8" in the allowed patterns or the
corresponding error-message lists; update both case branches to include "int8"
alongside "int8_wo" (e.g., add "int8" as an accepted pattern in each case
pattern list) and update the error/usage messages so that they list "int8" as an
accepted format where applicable.
fp16 | bf16 | fp8 | fp8_pc_pt | fp8_pb_wo | int8_wo | int4_awq | nvfp4 | nvfp4_awq | w4a8_awq | w4a8_nvfp4_fp8 | w4a8_mxfp4_fp8) ;; | ||
*) | ||
echo "Unsupported quant argument: Expected one of: [fp16, bf16, fp8, fp8_pc_pt, fp8_pb_wo, int4_awq, nvfp4, nvfp4_awq, w4a8_awq, w4a8_nvfp4_fp8, w4a8_mxfp4_fp8]" >&2 | ||
echo "Unsupported quant argument: Expected one of: [fp16, bf16, fp8, fp8_pc_pt, fp8_pb_wo, int8_wo, int4_awq, nvfp4, nvfp4_awq, w4a8_awq, w4a8_nvfp4_fp8, w4a8_mxfp4_fp8]" >&2 | ||
exit 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
HF allowlist should include ‘int8’ to match hf_ptq.py.
Without this, “--export_fmt hf --qformat int8” is blocked by the script. Add “int8” here.
Apply:
- fp16 | bf16 | fp8 | fp8_pc_pt | fp8_pb_wo | int8_wo | int4_awq | nvfp4 | nvfp4_awq | w4a8_awq | w4a8_nvfp4_fp8 | w4a8_mxfp4_fp8) ;;
+ fp16 | bf16 | fp8 | fp8_pc_pt | fp8_pb_wo | int8 | int8_wo | int4_awq | nvfp4 | nvfp4_awq | w4a8_awq | w4a8_nvfp4_fp8 | w4a8_mxfp4_fp8) ;;
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
fp16 | bf16 | fp8 | fp8_pc_pt | fp8_pb_wo | int8_wo | int4_awq | nvfp4 | nvfp4_awq | w4a8_awq | w4a8_nvfp4_fp8 | w4a8_mxfp4_fp8) ;; | |
*) | |
echo "Unsupported quant argument: Expected one of: [fp16, bf16, fp8, fp8_pc_pt, fp8_pb_wo, int4_awq, nvfp4, nvfp4_awq, w4a8_awq, w4a8_nvfp4_fp8, w4a8_mxfp4_fp8]" >&2 | |
echo "Unsupported quant argument: Expected one of: [fp16, bf16, fp8, fp8_pc_pt, fp8_pb_wo, int8_wo, int4_awq, nvfp4, nvfp4_awq, w4a8_awq, w4a8_nvfp4_fp8, w4a8_mxfp4_fp8]" >&2 | |
exit 1 | |
fp16 | bf16 | fp8 | fp8_pc_pt | fp8_pb_wo | int8 | int8_wo | int4_awq | nvfp4 | nvfp4_awq | w4a8_awq | w4a8_nvfp4_fp8 | w4a8_mxfp4_fp8) ;; | |
*) | |
echo "Unsupported quant argument: Expected one of: [fp16, bf16, fp8, fp8_pc_pt, fp8_pb_wo, int8_wo, int4_awq, nvfp4, nvfp4_awq, w4a8_awq, w4a8_nvfp4_fp8, w4a8_mxfp4_fp8]" >&2 | |
exit 1 |
🤖 Prompt for AI Agents
In examples/llm_ptq/scripts/huggingface_example.sh around lines 48 to 51, the
quantization allowlist is missing the "int8" option which prevents "--export_fmt
hf --qformat int8" from being accepted; add "int8" to the case pattern list (and
update the echo message to include "int8" among the expected values) so the
script accepts that quant argument and exits successfully for valid inputs.
fp8 | fp8_pc_pt | fp8_pb_wo | int8_wo | int8_sq | int4_awq | w4a8_awq | fp16 | bf16 | nvfp4 | nvfp4_awq | w4a8_nvfp4_fp8 | w4a8_mxfp4_fp8) ;; | ||
*) | ||
echo "Unknown quant argument: Expected one of: [fp8, fp8_pc_pt, fp8_pb_wo, int8_sq, int4_awq, w4a8_awq, fp16, bf16, nvfp4, nvfp4_awq, w4a8_nvfp4_fp8, w4a8_mxfp4_fp8]" >&2 | ||
echo "Unknown quant argument: Expected one of: [fp8, fp8_pc_pt, fp8_pb_wo, int8_wo, int8_sq, int4_awq, w4a8_awq, fp16, bf16, nvfp4, nvfp4_awq, w4a8_nvfp4_fp8, w4a8_mxfp4_fp8]" >&2 | ||
exit 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
General quant allowlist also needs ‘int8’ or guard by export_fmt.
Currently this block runs for HF too and rejects “int8”. Minimal fix: add “int8”.
Apply:
- fp8 | fp8_pc_pt | fp8_pb_wo | int8_wo | int8_sq | int4_awq | w4a8_awq | fp16 | bf16 | nvfp4 | nvfp4_awq | w4a8_nvfp4_fp8 | w4a8_mxfp4_fp8) ;;
+ fp8 | fp8_pc_pt | fp8_pb_wo | int8 | int8_wo | int8_sq | int4_awq | w4a8_awq | fp16 | bf16 | nvfp4 | nvfp4_awq | w4a8_nvfp4_fp8 | w4a8_mxfp4_fp8) ;;
Alternative (cleaner): wrap this whole validation in if [ "$EXPORT_FORMAT" != "hf" ]; then ... fi
so HF path is validated only once.
🤖 Prompt for AI Agents
In examples/llm_ptq/scripts/huggingface_example.sh around lines 77-80, the quant
argument validation excludes "int8" and incorrectly rejects it for the HF path;
either add "int8" to the list of allowed quant values in the case pattern or,
preferably, wrap this entire validation block in a guard so it only runs when
EXPORT_FORMAT is not "hf" (i.e., surround the case...esac with if [
"$EXPORT_FORMAT" != "hf" ]; then ... fi), ensuring HF exports are not blocked
and that "int8" remains accepted where appropriate.
if quantization in [QUANTIZATION_INT8_SQ, QUANTIZATION_INT8_WO]: | ||
return weight.to(torch_dtype) * weights_scaling_factor[:, None].to(torch_dtype) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
INT8 unpack path also misses 3D (MoE) weights.
Mirror the 3D case to restore full-precision weights correctly.
Apply:
- if quantization in [QUANTIZATION_INT8_SQ, QUANTIZATION_INT8_WO]:
- return weight.to(torch_dtype) * weights_scaling_factor[:, None].to(torch_dtype)
+ if quantization in [QUANTIZATION_INT8_SQ, QUANTIZATION_INT8_WO]:
+ if weight.dim() == 3:
+ return weight.to(torch_dtype) * weights_scaling_factor.unsqueeze(-1).to(torch_dtype)
+ elif weight.dim() == 2:
+ return weight.to(torch_dtype) * weights_scaling_factor[:, None].to(torch_dtype)
+ else:
+ raise NotImplementedError("INT8 dequantization expects 2D or 3D weight tensors")
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
if quantization in [QUANTIZATION_INT8_SQ, QUANTIZATION_INT8_WO]: | |
return weight.to(torch_dtype) * weights_scaling_factor[:, None].to(torch_dtype) | |
if quantization in [QUANTIZATION_INT8_SQ, QUANTIZATION_INT8_WO]: | |
if weight.dim() == 3: | |
return weight.to(torch_dtype) * weights_scaling_factor.unsqueeze(-1).to(torch_dtype) | |
elif weight.dim() == 2: | |
return weight.to(torch_dtype) * weights_scaling_factor[:, None].to(torch_dtype) | |
else: | |
raise NotImplementedError("INT8 dequantization expects 2D or 3D weight tensors") |
12d955e
to
1b8036e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Nitpick comments (1)
tests/examples/llm_ptq/test_llm_ptq.py (1)
98-98
: Explicitly set KV-cache quantization to none for INT8 weight-only HF exportAligns the test with the documented usage and avoids relying on defaults that might change.
- PTQCommand(quant="int8_wo", export_fmt="hf"), + PTQCommand(quant="int8_wo", export_fmt="hf", kv_cache_quant="none"),
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (8)
docs/source/guides/_compress_quantized_models.rst
(1 hunks)examples/llm_ptq/hf_ptq.py
(3 hunks)examples/llm_ptq/scripts/huggingface_example.sh
(2 hunks)modelopt/torch/export/model_config.py
(2 hunks)modelopt/torch/export/quant_utils.py
(5 hunks)modelopt/torch/quantization/config.py
(2 hunks)tests/examples/llm_ptq/test_llm_ptq.py
(1 hunks)tests/gpu/torch/export/test_export.py
(2 hunks)
✅ Files skipped from review due to trivial changes (1)
- docs/source/guides/_compress_quantized_models.rst
🚧 Files skipped from review as they are similar to previous changes (6)
- modelopt/torch/quantization/config.py
- examples/llm_ptq/scripts/huggingface_example.sh
- modelopt/torch/export/model_config.py
- examples/llm_ptq/hf_ptq.py
- tests/gpu/torch/export/test_export.py
- modelopt/torch/export/quant_utils.py
🧰 Additional context used
🧬 Code graph analysis (1)
tests/examples/llm_ptq/test_llm_ptq.py (1)
tests/_test_utils/ptq_utils.py (1)
PTQCommand
(28-79)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
- GitHub Check: linux
- GitHub Check: wait-checks / wait
- GitHub Check: code-quality
- GitHub Check: build-docs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @Yuening-wa for adding int8-wo support. 👍
Do we how the accuracy and perf look like compared with int8-sq?
Thanks @Edwardf0t1 for the review. Here is the accuracy comparison with BF16 of Qwen3-30B-A3B model on MMLU and GSM8K.
|
d444ca9
to
4e07a62
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall LGTM. The benefit of int8wo vs int8_sq is that int8wo has TRTLLM torch backend support.
Before approval, could you add this mode into the following tests:
https://github.com/NVIDIA/TensorRT-Model-Optimizer/blob/main/tests/examples/llm_ptq/test_llm_ptq.py
https://github.com/NVIDIA/TensorRT-Model-Optimizer/blob/main/tests/gpu/torch/export/test_unified_hf_export_and_check_safetensors.py
Signed-off-by: Yuening Li <[email protected]>
4e07a62
to
4ea627f
Compare
Thanks for the comments @cjluo-nv. Added int8wo mode into these two tests. |
Signed-off-by: Yuening Li <[email protected]> Signed-off-by: Ye Yu <[email protected]>
What does this PR do?
Type of change: new feature
Overview: Add support for INT8 weight-only per-channel quantization. The output int8 quantized checkpoint is HuggingFace format and can be directly used in TRTLLM PyTorch workflow.
Usage
Testing
Before your PR is "Ready for review"
Additional Information
Summary by CodeRabbit
New Features
Examples
Library
Documentation
Tests